Chosen Parameters

This section allows the user to set parameters that are specific to their dataset (e.g., file path to count matrix, sample and group names, desired pairwise group comparisons, etc.). This is the only section of the script that requires any user input/modification. If the script produces errors down the line, it is most likely due to incorrect parameter specifications!

## Specify file path to count matrix input table. Assumed to be a tab-separated count matrix table containing AT LEAST an additional column for ENSEMBL gene ids (see parameter "unique_id" below)
filepath = "./DemoDataset.txt"


## Specify unique gene id column name containing ENSEMBL gene ids (e.g. "ENSMUSG00000051951"). 
unique_id = "gene_id"


## Specify gene name column containing gene symbols (e.g. "Xkr4") if present. If not present, set this parameter to NULL (i.e. gene_name = NULL), and gene names will be extracted based on ENSEMBL gene ids via biomaRt
gene_name = NULL


## Specifiy the column name that indicates whether a gene is protein coding or not, if present. If not present, set this parameter is set to NULL (i.e. biotype_name = NULL), and biotypes will be extracted based on ENSEMBL gene ids via biomaRt
biotype_name = NULL


## Specify count columns regular expression pattern. This regular expression pattern should match the read count column headers. You can test if your pattern is specified correctly via grep(names(df), pattern=count_column_pattern, value=TRUE), which should return the count column headers.
count_column_pattern = "_S[0-9]*$"


## Specify sample names character vector in the order the samples are listed in the count table (and matched by count_column_pattern)
samplenames = c("KO_effector_10", "KO_effector_11", "KO_effector_12", 
                "WT_naive_1", "WT_naive_2", "WT_naive_3", 
                "WT_effector_4", "WT_effector_5", "WT_effector_6", 
                "KO_naive_7", "KO_naive_8", "KO_naive_9")


## Specify group names character vector in the order the samples are listed in the count table (and matched by count_column_pattern). This information is important for differential expression testing.
groups = rep(c("KO_effector","WT_naive","WT_effector", "KO_naive"), each=3)


## Optional: Specify batches as factor in the order the samples are listed in the count table (if present. Else: set to NULL). Batches will be considered in differential expression testing
batch = NULL
  

## Optional: Specify sample reordering indices as numerical vector. This parameter reorders samples for visualization purposes, e.g. clustering gene progiles. If not desired, set this parameter to NULL
reorder = c(4,5,6,10,11,12,7,8,9,1,2,3)


## Specify minimum number read counts per gene across all samples. Genes (i.e. rows) with total count below the threshold will be filtered out.
min_counts_threshold_per_row = 50


## Specify minimum number of read counts per cell entry. Genes (i.e. rows) with any entry below the threshold will be filtered out. This filter is a bit strict, but it can help to stabilize within-group gene variances (-> important for the applicability of DESeq's FC shrinkage estimation). This is particularly useful when the data contains a lot of missing value entries ("drop-out events"), which are more common in low input RNA sequencing experiments. Check out the scripts output (more specifically, the volcano plot fold changes) for varying thresholds to see if this filtering improves the data! If you want to be permissive with this filtering (recommended for bulk RNAseq experiments with a lot of cells), just set this filter to 0.
min_counts_threshold_per_cell = 1


## Define number of k-means clusters for clustering 
k_clusters = 8


## Specify if log2 counts should be standardized (i.e. "Z-scored") within each gene for the k-means clustering. If set to FALSE, log2(x+1)-transformed counts will instead "only" be centered to an average of 0, meaning that gene-variances are not fixed at 1.
standardize_before_clustering = TRUE


## Specify if genes should be additionally filtered for overall significance (H0: no difference between groups) before k-means clustering. Recommended if standardize_before_clustering = TRUE
ANOVA_filter_before_clustering = TRUE


## Specify at which ANOVA p-value cutoff the ANOVA_filter_for_clustering should be applied. Gnenes that do not pass this treshold (i.e. have a lower p-value) will not be used in the k-means clustering
adj.pval_cutoff = 0.01


## Specify if overrepresentation analysis (ORA) enrichment analys susing gProfiler on k-means clusters should be performed; and if gene set enrichment analysis (GSEA) on differential expression testing results should be performed.
enrichment_analysis = TRUE


## For ORA enrichment analysis of clusters via gprofiler, deifine if a custom background should be used (from a statistical standpoint, I recommend this as it prevents bias). The custom background is defined as all genes quantified in the experiment.
custom_background = TRUE


## Define organism for ORA and GSEA enrichment analysis (note: specify "mmusculus" for mouse, and "hsapiens" for human). The script currently only supports mouse and human; but it can be manually adapted to any handle any other organism.
organism = "mmusculus"


## Specify biomaRt mirror. Sometimes the chosen mirror (even the default option) produces errors. You can set this parameter to "useast", "asia" and "www".
biomart_mirror = "useast"


## Specify which pairwise group comparisons should be made using DESeq2. This parameter should be specified as a list of two-element character vectors which contains groups to be tested against each other. 
pairwise_comp = list(c("KO_naive", "WT_naive"),
                     c("WT_effector","WT_naive"),
                     c("KO_effector", "WT_effector"))


## Specify genes of special interest via gene name (i.e. gene symbol). These genes will be highlighted extra in some of the figures.
genes_of_special_interest = c("Il2", "Il4", "Bcl6", "Gata3", "Rin1", "Rin3", "Rin2", "Cxcr5", "Ifng", "Stat4", "Eif5b", "Foxo1", "Sox12", "Ncor2","Jun", "Rinl", "Fbxo27", "Fbxo17")


## Specify whether final data table (containg k-means cluster information and DESeq2 results) should be exported as txt-file
export_table = TRUE





Overview and Filtering

## Overview of data:
## Number of genes before filtering for at least 50 counts per gene across all samples: 52188
## Number of genes after filtering for at least 50 counts per gene across all samples: 16756
## Number of genes before filtering for at least 1 count(s) per entry: 16756
## Number of genes after filtering for at least 1 count(s) per entry: 15815





Between-Sample Data Normalization

Between-sample normalization is performed using DESeq’s estimateSizeFactors function which estimates sample-wise scaling factors that best align each sample to each other sample. First, the unnormalized input data is visualized:

## Barplot and boxplot for unnormalized data:

Next, the normalized data (to be used in all subsequent analysis steps) is visualized:

## Barplot and boxplot after between-sample normalization:

## Pairwise scatterplots of log2-transformed counts (normalized):

If normalized properly, we expect most genes to fall around the dashed line, especially for samples within the same sample group. Note: If there are more than 7 samples in the data, this plot only visualizes the first 7 samples after normalization.





General Data Visualization


This section shows PCA and Heatmaps of the normalized and log2(x+1)-transformed data. Note that the data is not batch-corrected at this point, meaning that potential batch-effects would therefore be visible here. Also, the PCA plot is interactive in the html output! Hover over points to retrieve sample-specific information.

## PCA dimension reduction of log2-transformed counts (normalized):


## Heatmap of log2-transformed counts (normalized):

## Heatmap of 0-centered log2-transformed counts (normalized):

## Heatmap of standardized/Z-scored log2-transformed counts (normalized):

Note that each of the three heatmaps can highlight a different aspect of the data!





ANOVA

Here, a simple one-way ANOVA tests for gene-wise differential expression (DE) between all sample groups on the basis of between-sample-normalized and log2(x+1)-transformed gene counts (H0: All groups are equal; H1: At least one group differs from the others). If a batch effect is specified, the ANOVA automatically switches to a 2-way ANOVA and incorporates the batch variable as a covariate into the model. Note that this section is entirely skipped if there are no more than 2 groups overall, or if the minimum number of replicates per group is smaller than 3.

## plotting ANOVA p-value histogram:

The histogram bar on the very left reflects p-values < 0.05. Note that per definition, p-values are uniformly distributed if the null hypothesis is true (this applies to any statistical test if the required assumptions for the test are also met). Hence, looking at p-value histograms can be quite informative!





k-Means Clustering: Overview

In this section, k-means clustering is performed on between-sample-normalized and log2(x+1)-transformed gene-wise transcript abundance profiles to find gene co-expression clusters. Depending on how the parameter “standardize_before_clustering” is specified, the gene-wise quantitative data will be centered at mean 0 (if set to FALSE), or standardized/Z-scored (if set to TRUE) prior to k-means clustering.

If centered (i.e., standardize_before_clustering = FALSE), differences in the y-axis range still represent observed log2 fold change differences between samples.

If standardized/Z-scored (i.e., standardize_before_clustering = TRUE), log2-transformed counts are also normalized to unit variance within each row (i.e. row-wise variance reaches 1). A drawback of this transformation is that any subsequent clustering will also pick up common noise patterns of non-DE genes (e.g. batch effects, etc.), since variances of non-DE genes will be equally scaled up to 1, thereby amplifying “noise”. I therefore recommend first filtering for genes that show significant (adj. pval < 0.05) differential expression between groups via ANOVA-derived p-values, which can be accomplished by specifying the parameter “ANOVA_filter_before_clustering” and “adj.pval_cutoff” accordingly. For this analysis, the chosen parameters are:

print(k_clusters)
## [1] 8
print(standardize_before_clustering)
## [1] TRUE
print(ANOVA_filter_before_clustering)
## [1] TRUE
print(adj.pval_cutoff)
## [1] 0.01


The following plot (often referred to as “Ellbow-plot”) can help to find a reasonable/suitable total number of clusters:

## Residual sum of squares for different k:


Let’s investigate the k-means clustering results:

## Number of genes per cluster:

## Cluster 1:

## Cluster 2:

## Cluster 3:

## Cluster 4:

## Cluster 5:

## Cluster 6:

## Cluster 7:

## Cluster 8:

For each k-means cluster, the feature plots above visualize gene-wise transcript abundance profiles as dashed black lines. Profiles of genes specified as “genes of special interest” (see parameter section) are additionally highlighted in color.

For further exploration and annotation of gene expression clusters, lists of the genes in each cluster are stored in the folder “clusters” as simple text files. These files can be used as direct input to the STRING webtool (https://string-db.org) for network visualization.

Please note that k-means clustering as performed here will NOT ensure that all genes fit well to their respective cluster. To remedy that, one could remove genes from clusters based on cluster membership (e.g., pearson correlation to cluster center).

Further, note that there are clustering algorithms that might work better for your data (e.g. correlation clustering using the R package “WGCNA”). However, more complex clustering algorithms might require the specification of many different parameters for optimal results, which in turn complicates automatization (compare with k-means clustering performed here were just a single parameter k needs to be specified). Ultimately, treat the k-means clustering results here with some reservation - they might not be “optimal” but still they facilitate quick and easy data exploration!




k-Means Clustering: Overrepresentation Analysis (ORA)

In this section, overrepresentation analysis (ORA) is performed for all protein-coding genes within the co-expression clusters inferred above using the gprofiler R package. If the parameter custom_background is set to TRUE (recommended), all quantified protein-coding genes of experiment are defined as the statistical background. If custom_background is set to FALSE, ORA takes the entire annotated genome of the specified organism as a background (which can introduce bias). g:Profiler tests for overrepresentation of specific annotated gene sets via Fisher’s Exact Test (FET). These gene sets come from several public databases like GO (Gene Ontology), REAC (Reactome), and WP (WikiPathways).

In the html report, the ORA enrichment analysis returns no visual output for a cluster if the cluster contains more than 1/3 of the total number of quantified protein-coding genes in the experiment (this setting is intentional in order to limit processing time - the bigger the query list, the more time the calculation takes; and usually such big clusters don’t feature any enrichment compared to the background). There is also no visual output when the ORA does not return any statistically significant hits. Moreover, gene sets with a term size over 3000 are not displayed (they are often too generic to be very informative), and neither are enrichment terms that produce an intersection set consisting of only a single gene.

The output plots are reminiscent of Manhatten-plots. The x-axis corresponds to functional terms (i.e. the gene sets) which are grouped and color-coded according to the different database sources of origin. The y-axis displays the significance of gene sets tested for enrichment (i.e., FDR-adjusted p-values). The point size corresponds to the number of genes in the respective gene set.

Note that the ORA results for each cluster are also saved in table format in the folder “ORA”.

The parameters for this analysis were chosen as:

print(enrichment_analysis)
## [1] TRUE
print(custom_background)
## [1] TRUE
print(organism)
## [1] "mmusculus"
print(pval_threshold_enrichment)
## [1] 0.05
print(sources_enrichment)
## [1] "GO"   "REAC" "WP"





DE Testing and GSEA

Differential expression (DE) testing between groups is performed using the DESeq2 R package. This package models RNA read counts as negative binomially distributed count data within a generalized linear model setting, and allows for pairwise group comparisons. Batch effects (if specified in the parameter section) will be included as fixed effects into the model (which essentially “regresses out” their influence). The groups to be tested against each other are specified by the parameter “pairwise_comp”:

print(pairwise_comp)
## [[1]]
## [1] "KO_naive" "WT_naive"
## 
## [[2]]
## [1] "WT_effector" "WT_naive"   
## 
## [[3]]
## [1] "KO_effector" "WT_effector"

The section below sequentially goes over each specified pairwise group comparison and visualizes p-value histograms and volcano plots. Note that fold change shrinkage estimation has been performed, which reduces absolute values of fold changes for low-abundance (i.e. high variance genes) and is part of DESeq2’s default workflow. DE testing results are added as extra columns to the dataframe which is ultimately exported (see final section).

In addition, for each specified pairwise comparison the gene-wise fold changes are used to conduct gene set enrichment analysis (GSEA) with gene sets from the gene ontology and hallmark databases (both retrieved from MSigDB via the msigdbr R package). For performing and visualizing GSEA results, the R packages ClusterProfiler and enrichplot are employed. GSEA result tables are stored in the folder “GSEA”.

## Pairwise Group Comparison 1:
## [1] "KO_naive" "WT_naive"
## Results:
## 
## out of 15815 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 162, 1%
## LFC < 0 (down)     : 89, 0.56%
## outliers [1]       : 0, 0%
## low counts [2]     : 1840, 12%
## (mean count < 14)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## 
## Histogram of p-values and adjusted p-values:

## Volcano Plot:

## Performing Gene Set Enrichment Analysis (GSEA) based on fold change estimates.
## Total number of significantly enriched gene sets (adj.pval < 0.05): 14
## Among them...
## number of upregulated gene sets (NES>0): 14
## number of downregulated gene sets (NES<0): 0
## Upgregulated means that genes have predominantly positive log2 fold changes, thus NES>0.
## Downregulated means that genes have predominantly negative log2 fold changes, thus NES<0.
## 
## Dotplot of top upregulated gene sets:

## Enrichment map of top upregulated gene sets:

## Dotplot of top downregulated gene sets:
## Enrichment map of top downregulated gene sets:

## Pairwise Group Comparison 2:
## [1] "WT_effector" "WT_naive"   
## Results:
## 
## out of 15815 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 4001, 25%
## LFC < 0 (down)     : 3713, 23%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## 
## Histogram of p-values and adjusted p-values:

## Volcano Plot:

## Performing Gene Set Enrichment Analysis (GSEA) based on fold change estimates.
## Total number of significantly enriched gene sets (adj.pval < 0.05): 303
## Among them...
## number of upregulated gene sets (NES>0): 279
## number of downregulated gene sets (NES<0): 24
## Upgregulated means that genes have predominantly positive log2 fold changes, thus NES>0.
## Downregulated means that genes have predominantly negative log2 fold changes, thus NES<0.
## 
## Dotplot of top upregulated gene sets:

## Enrichment map of top upregulated gene sets:

## Dotplot of top downregulated gene sets:

## Enrichment map of top downregulated gene sets:

## Pairwise Group Comparison 3:
## [1] "KO_effector" "WT_effector"
## Results:
## 
## out of 15815 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 251, 1.6%
## LFC < 0 (down)     : 270, 1.7%
## outliers [1]       : 0, 0%
## low counts [2]     : 2453, 16%
## (mean count < 20)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## 
## Histogram of p-values and adjusted p-values:

## Volcano Plot:

## Performing Gene Set Enrichment Analysis (GSEA) based on fold change estimates.
## Total number of significantly enriched gene sets (adj.pval < 0.05): 9
## Among them...
## number of upregulated gene sets (NES>0): 9
## number of downregulated gene sets (NES<0): 0
## Upgregulated means that genes have predominantly positive log2 fold changes, thus NES>0.
## Downregulated means that genes have predominantly negative log2 fold changes, thus NES<0.
## 
## Dotplot of top upregulated gene sets:

## Enrichment map of top upregulated gene sets:

## Dotplot of top downregulated gene sets:
## Enrichment map of top downregulated gene sets:

Export Table

Finally, the filtered and normalized data is exported and named “Table_Export_[CURRENT DATE].txt”. The exported table contains all DE testing results + k-means cluster information as additional columns.

While generating the html report, all created figures were stored in the folder “figures” and are available in both png and pdf file format. Enrichment testing result tables (ORA and GSEA) were stored in the folders “ORA” and “GSEA”, respectively.

## Generated output txt file file called:
## Table_Export_2024-03-30.txt
## (15815 rows, 47 columns)